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Abstract 

Many applications, such as intermittent data assimilation, lead to a recursive application of Bayesian 
inference within a Monte Carlo context. Popular data assimilation algorithms include sequential Monte 
Carlo methods and ensemble Kalman filters (EnKFs). These methods differ in the way Bayesian inference 
is implemented. Sequential Monte Carlo methods rely on importance sampling combined with a resampling 
step while EnKFs utilize a linear transformation of Monte Carlo samples based on the classic Kalman filter. 
While EnKFs have proven to be quite robust even for small ensemble sizes, they are not consistent since 
their derivation relies on a linear regression ansatz. In this paper, we propose another transform method, 
which does not rely on any a prior assumptions on the underlying prior and posterior distributions. The 
new method is based on solving an optimal transportation problem for discrete random variables. 

1 Introduction 

This paper is concerned with a particular implementation of Monte Carlo methods for Bayesian inference and its 



application to filtering and intermittent data assimilation ( Jazwinski 1970j ). More specifically, we consider the 
problem of estimating posterior expectation values under the assumption that a finite-size ensemble {x^^^i 
from the (generally unknown) prior distribution ttj^/ is available. A standard approach for obtaining such 
estimators relies on the idea of importance sampling based on the likelihood 7ry(yo|a;f ) of the samples x{ with 



regard to a given observation (Doucet et al. 2001 Arulampalam et al. 2002[ Bain and Crisan 20081. If 
applied recursively, it is necessary to combine importance sampling with a resampling step such as monomial or 
systematic resampling ( Arularnpalam et aTj 2002 Kiinsch 2005 ) . Even more recently the ensemble Kalman filter 
(EnKF) has been introduced (Evensen 20061, which transforms the prior ensemble jccf j,^-, into a un i forml y 
weighted posterior ensemble {a;°}^^j^ using the classic Kalman update step of linear filtering (Jazwinski 19701. 
The EnKF leads, however, to a biased estimator even in the limit M oo (Lei and Bickel 20111. In this 



paper, we propose a non-random ensemble transform method which is based on finite-dimensional optimal 



transportation in form of linear programming (Strang 1986 Cotter and Reich 2012). We provide numerical 



and theoretical evidence that the new ensemble transform method leads to consistent posterior estimators. The 
new transform method can be applied to intermittent data assimilation leading to a novel implementation of 



particle filters. We demonstrate this possibility for the chaotic Lorenz-63 model (Lorenz 1963) 



An outline of the paper is as follows. In Section [2j importance sampling Monte Carlo is summarized in the 
context of Bayesian inference. Subsequently importance sampling is put into the context of linear programming 
in Section [3j This leads to a novel resampling method which maximizes the correlation between the prior 
and posterior ensemble members. We propose a further modification which turns the resampling step into 
a deterministic and linear transformation. Convergence of the proposed transformation step is demonstrated 
numerically by means of two examples. A theoretical convergence result is formulated for univariate random 
variables with compactly supported probability distributions. Finally, the application to sequential Monte 
Carlo methods is discussed in Section |4] and gives rise to a novel ensemble transform filter. Numerical results 
are presented for the Lorenz-63 model. 



2 Bayesian inference and importance sampling 

We summarize the importance sampling approach to Bayesian inference. Given a prior (or in the context of 
dynamic models, forecasted) random variable : M^, we denote its probability density function (PDF) 
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by 7Txf{x), X G M.^ , we consider the assimilation of an observed yo G with hkelihood function 7ry(i/|a:). 
According to Bayes' theorem the analyzed, posterior PDF is given by 

/ I N TrY{yo\x)nxf{x) 



/rjv TTY{yo\x)TTxf{x)dx' 

Typically, the forecasted random variable and its PDF are not available explicitly. Instead one assume 
that an ensemble of forecasts x{ € M.^ , i = 1, . . . , M, is given, which mathematically are considered as realiza- 
tions Xf (w), w S n, of Af independent (or dependent) random variables xf : f2 — > with law irxf- Then 
the expectation value = Exf[g] of a function g : — > M with respect to the prior PDF Trxf{x) can be 
estimated according to 



M 

i=l 



with realization 



M M 

-g{, = G{,H = ^E.9(^/H) = M^^^^'') 

1=1 i=l 



for the ensemble {a;f }f£i- The estimator is unbiased for any M > and its variance vanishes as M — > oo. 



Following the idea of importance sampling (Liu 2001 1, one obtains the following estimator with respect to 



the posterior PDF 7rx"(a;|yo) using the forecast ensemble: 



M 

all ^^m9ix{), 

with weights 

L;^=l7^Y(yo|a;■ ) 

The estimator is no longer unbiased for finite M but remains consistent. Here an estimator is called consistent 
if the root mean square error between the estimator g\.^ and the exact expectation value g"" vanishes as M — oo. 



3 An ensemble transform method based on linear programming 

Alternatively to importance sampling, we may attempt to transform the samples x{ ~ x[{uj) with x{ ~ nxf 
into samples xf which follow the posterior distribution TTx<'{x\yo)- Then we are back to an estimator 



1 

9m 



M 



M 
j=i 

with equal weights for posterior expectation values. For univariate random variables X^ and X'^ with PDFs 
■Kxf and TTx" , respectively, the transformation is characterized by 

Fx^{x'i) = Fxf{xl), (2) 

where Fxf{x) and i^x" denote the cumulative distribution functions oi X^ and X"", respectively, e.g. 

Fxf{x)^ I 7rxf{x')dx'. 



Eq. ^ requires knowledge of the associated PDFs and its extension to multivariate random variables is non- 
trivial. In this section, we propose an alternative approach that does not require explicit knowledge of the 
underlying PDFs and that easily generalizes to multi-variate random variables. To obtain the desired trans- 
formation we utilize the idea of optimal transportation dVillani 2003 1 with respect to an appropriate distance 



d{x,x') in M^. More precisely, we first seek a coupling between two discrete random variables X-^ : fl' ^ X 
and X° : ri' — > A" with realizations in A" = {x{, . . . ,a;{^} and probability vector = {1/M, . . . , for 
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X-'^ and = {wi, . . . ,wm)'^ for X", respectively. A coupling between X-'^ and X° is a M x M matrix T with 
non-negative entries tij = {T)ij > such that 

M M 

Y,tij = l/M Y,u,=w,. (3) 

i=i j=i 

We now seek the coupling T* that minimizes the expected distance 

The desired coupling T* is characterized by a linear programming problem ( Strang 1986 1 . Since ([s]) leads to 
2M — 1 independent constraints the matrix T* contains at most 2M — 1 non-zero entries. 
In this paper, we use the squared Euclidean distance, i.e. 

d{x{,xf^) = \\x{-xf\\\ (4) 

We recall that minimizing the expected distance with respect to the squared Euclidean distance is equivalent 
to maximizing Ex/x» [(2;^)"^a;°] since 

Ex/X" [||^^ - = Ex/ [W^ff] + Ex» [Wx^f] 2Ex/x" [{x^fx'^]. 

Hence minimizing the expected distance is equivalent to maximizing correlation. 
We next introduce the Markov chain P G jjMxa/ ^ ^-^^ 

P = M T* 

with the property that 



= Pp 



/ 



Given realizations a;J, j — 1^ . . . , M from the prior PDF, a Monte Carlo resampling step proceeds now as follows: 
Define discrete random variables 

\ PMj I 

for j = 1, . . . , M . Here pij denotes the (i, j)th entry of P. Note that the random variables X|, = 1, . . . , Af , are 
neither independent nor identically distributed. A new ensemble of size M is finally obtained via = X°(a;), 
j = 1, . . . ,M . This ensemble of equally weighted samples allows for the approximation of expectation values 
with respect to the posterior distribution ttx" (a;|2/o)• 



The outlined procedure leads to a particular instance of resampling with replacement ( Arulampalam et al. 



2002 Kiinsch 2005 1 . The main difference to techniques such as monomial or systematic resampling is that 
the resampling is chosen such that an expected distance d{x-^,x°') between the prior and posterior samples is 
minimized. 

We now propose a further modification which replaces the random resampling step and avoids obtaining 
multiple copies in the analyzed ensemble {xf}. The modification is based on the observation that 

M 

x'^=Ex^[x]=Y.P^,x{. 

i=l 

We use this result to propose the deterministic transformation 

x'; = x;, (6) 

J = 1, . . . , M. The idea is that 

M 



M 
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Figure 1; Prior and posterior xf realizations from the transform method for AI = 10. 



stiU provides a consistent estimator for Ex<'[(?] as M — > oo. For the special case g{x) = a; it is easy to verify 
that indeed 

M M M M 



j = l j = l 1=1 i,j 1=1 



Before investigating the theoretical properties of the proposed transformation ([6| we consider two examples 
which indicate that ^ lead to a consistent approximation to ^ in the limit M — oo. 

Example. We take the univariate Gaussian with mean x — 1 and variance cr^ = 2 as prior random variable 
X^. Realizations of X-^ are generated using 

x{ = V2err\2u.-ll ^ ^ + ^ 

for i = 1, . . . , M. The likelihood function is 

, . . 1 f-iy — xY 

'^Y{y\x) = —= exp 
\/47r 



with assumed observed value = 0.1. Bayes' formula yields a posterior distribution which is Gaussian with 
mean x — 0.55 and variance = 1. The prior and posterior realizations from the transform method are shown 
for M = 10 in Figure [T] We also display the analytic transform, which is a straight line in case of Gaussian 
distributions, and the approximate transform using linear programming in Figure [2j The structure of non-zero 
entries of the Markov chain matrix P for M = 40 is displayed in Figure [Sj which shows a banded structure of 
local interactions. More generally, one obtains the posterior estimates for the first four moments displayed in 
Table [1} which indicate convergences as Af — oo. 

Example. As a further (non-Gaussian) example we consider a uniform prior on the interval [0, 1] and use 
samples x{ — Ui with the UiS as defined in the previous example. Given the observed value y^ = 0.1, the 
posterior PDF is 

-(.-o.i)V4 a;e[o,l] 



7rx»(x|0.1) 



0.9427.. 

else 



The resulting posterior mean is x ~ 0.4836 and its variance k, 0.0818. The third and fourth moments are 
0.0016 and 0.0122, respectively. The transform method yields the posterior estimates for the first four moments 
displayed in Table [2] which again indicate convergences as M — >• oo. 
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Figure 2: Exact and numerical ensemble transform map for M = 10. The Gaussian case leads to the exact 
transformation being linear. The numerical approximation deviates from linearity mostly in its both tails. 
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Figure 3: Non-zero entries in the matrix P for M = 40, i.e. the support of the coupling. There are a total of 
2M — 1 = 79 non-zero entries. The banded structure reveals the spatial locality and the cyclical monotonicity 



(Villani 2003) of the resampling step. 
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Table 1: Estimated posterior first to fourth-order moments from the ensemble transform method applied to a 
Gaussian scalar Bayesian inference problem. 



X K[(X-xf] E[{X-xf] 

M = 10 0.5361 1.0898 -0.0137 2.3205 

M = 40 0.5473 1.0241 0.0058 2.7954 

M = 100 0.5493 1.0098 -0.0037 2.9167 



Table 2: Estimated posterior first to fourth-order moments from the ensemble transform method applied to a 
non-Gaussian scalar Bayesian inference problem. 







X 


a2 


E,[{X-xf] 


E[(X-i)4] 


M 


= 10 


0.4838 


0.0886 


0.0014 


0.0114 


M 


= 40 


0.4836 


0.0838 


0.0016 


0.0121 


M 


= 100 


0.4836 


0.0825 


0.0016 


0.0122 



Theorem. Assume that the ensemble {a;f }f£]^ is a realization of M identically and independent distributed 
univariate random variables X( : r2 — >■ [a, 6] C K with PDF tt^ / . Without restriction we may assume that the 
ensemble members x{ G [a, b] are mutually distinct and that they have been sorted such that 

f f f f 

Xi ^ ^2 ^ ^ -^M-l ^ •''Af- 

We finally assume that there is a constant S > such that tt^/ (x) > 5 for all x G [a, b]. Then xf, given by (|6|, 
converges to xf, defined by ([2]), in probability as Af — > cx). 

Proof. Given an ei > 0, we chose an integer L — L{ei) sufficiently large such that truncated weights 

w, e {0,l/iLM),2/{LM),...,l} 

satisfying 

\wi — wA 



l/M 

can be defined for i = 1, . . . , M. Here the weights Wi are defined by ([I]). 

It follows that the linear programming problem with weights Wi replaced by Wi leads to a matrix P with 
maximal L non-zero entries in each column and those non-zero entries cluster into a single block due to cyclical 
monotonicity of optimal transport plans ( [Villani 2003 ) . See also Figure [3j 



Next we decompose the interval [a, b] into J disjunct intervals Ij such that 

'nxi{x)dx = J. 
Given an £2 > 0, we now select a, J — J{s2) such that 

/ dx < £2- 

This is possible since ttjsj-/ is bounded from below. 

Large deviation results for monomial resampling (see, for example, Lemma 1 in Devroye and Gyorfi (1985)) 
imply that the probability of having less than L members of the ensemble {x{}fL^ in each of the subintervals li 
decays exponentially fast as Af — )■ 00. Hence the standard deviation of Xf, as defined by ([5]), can be bounded 
by £2. Taking the fimit £,; — > 0, i = 1, 2, we obtain the convergence in probabihty of — xf to xf, as defined 



6 



by (B. The limit can be achieved, for example, by choosing J = 0(Af ^Z^) and K = 0{M^/'^). 



□ 



Remark. Cyclically monotonicity for optimal couplings /x between two PDFs ttx/ and ttx" states that any 
sequence of points {x{ ,y°-), i — 1, . . . ,1, from the support spt(/x) of /x, i.e. {x{,xf) £ spt(/^), satisfies 



^d(xf,<)<^d(xf,<(,)) 



i=l 



where a denotes any permutation of the indices ie{l,2,...,/}( Villani , 2003 1 . Cyclically monotonicity applies 
in particular to any sequence of pairs i — 1,. . . , M, obtained from the resampling step For the 

Euclidian distance Q, this property carries over to the sample means and, hence, to pairs {x{,xf) obtained 
from the transformation This observation together with the results from McCann ( 1995[) shoul d allow for 
a more direct convergence proof. In particular, the techniques leading to Theorem 6 in McCann ( 1995 ) are 
directly applicable. 



One may replace the uniform probabilities in p-f by an appropriate random vector = {w(, . . . , it;^^)^, i.e. w( > 
and ^/ = 1- To clarify the notations we write p"" — (wj, . . . ,'w'\[)'^ for the posterior weights according 

to Bayes's formula. The linear programming task is being adjusted accordingly and one obtains an optimal 
coupling T* and an induced Markov chain P with entries 



Hence the transform method is now defined by 



M 



(7) 



and the posterior ensemble mean satisfies 

X]^.j — 



1=1 



M M 



f ^ij f 

—xj 

j=l j = l i=l 



M 



as desired. More generally, posterior expectation values are given by 

M 

Tm 



E^/3«) 



4 Application to sequential data assimilation 

We now apply the proposed ensemble transformation (ET) method to sequential state estimation for ordinary 
differential equation models 

X = fix) (8) 

with known PDF ttq for the initial conditions a;(0) G at time t = 0. Hence we treat solutions x{t) as 
realizations of the random variables Xt, t > 0, determined by the flow of ^ and the initial PDF ttq. 

We assume the availability of observations y{tk) G at discrete times tk — fcAtobs, fc > 0, in intervals of 
Aiobs > 0. The observations satisfy the forward model 

Y{tk) = h{x^cf{tk)) + Sfe, 

where : — >■ represent independent and identically distributed centered Gaussian random variables with 
covariance matrix i? € M^^-^, /i : ^- is the forward map, and Xrc{{t) G denotes the desired reference 
solution. The forward model gives rise to the likelihood 
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Figure 4: Time mean RMS errors for the Lorenz-63 model in the setting of Anderson (20101 for an ensemble 



square root filter (ESRF) and the new ensemble transform (ET) filter for increasing ensemble sizes M. 



A particle filter starts from an ensemble {xi{Q)}fLi of M realizations from the initial PDF ttq. We evolve this 
ensemble of realizations under the model dynamics ([s]) till the first observation ?/obs(Afobs) becomes available 
at which point we apply the proposed ET method to the forecast ensemble members x{ = Xi(Atobs)- If one 
furthermore collects these prior realizations into a. N x M matrix 

X-^ = [x( ■ ■ ■ 

then, for given observation yo = ^(Atobs), the ET method ([t]) leads to the posterior realizations simply given 

by 

x« = x^p, [x^••xl,] = x^ (9) 

where P is the Markov chain induces by the associated linear programming problem. The analysed ensemble 
members a:°, i = 1, . . . , M, are now being used as new initial conditions for the model Q and the sequence of 
alternating between propagation under model dynamics and assimilation of data is repeated for all A; > 1. 

It should be noted that a transformation similar to ^ arises from the ensemble square root filter (ESRF) 
(Evensen 2006). However, the transform matrix 



P e 



■IxM 



used here is obtained in a completely different 



manner and does not relly on the assumption of the PDFs being Gaussian. We mention the work of |Lei and] 
Bickel (2011 ) for an alternative approach to modify EnKFs in order to make them consistent with non-Gaussian 



distributions. 

We now provide a numerical example and compare an ESRF implementation with a particle filter using the 
new ET method. 



Example. We consider the Lorenz-63 model ( Lorenz[ 1963) 



x = cr(y-x), 

y = x(p-z)-y, 
z = xy — /3z 



in the parameter and data assimilation setting of Anderson (2010). In particular, the state vector is a; = 
(x, y, z)'^ e M.^ and we observe all three variables every A^obs = 0.12 time units with a measurement error vari- 
ance i? = 8 in each observed solution component. The equations are integrated in time by the implicit midpoint 
rule with step-size At = 0.01. We implement an ESRF ( ^Evensen, ,2006 ) and the new ET filter for ensemble sizes 
M = 10, 20, 40, 60, 80, 100. The results for both methods use an optimized form of ensemble infiation. The ET 
nevertheless leads to filter divergence for M = 10 while the ESRF is stable for all given choices of M . The time 
mean root mean square (RMS) errors over 2000 assimilation steps can be found in Fig. |4] It is evident that 
the new ET filter leads to much lower RMS errors for all M > 20. The results also compare favorable to the 
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ones displayed in Anderson ( 2010[) for the rank histogram filter (RHF) Anderson (2010) and the EnKF with 
perturbed observations (Evensen 12006). 



5 Conclusions 

We have explored the application of linear programming and optimal transportation to Bayesian inference 
and particle filters. Numerical evidence indicates that the proposed ET method allows to reproduce poste- 
rior expectation values in the limit M — > oo. It would be of interesting to prove the convergence of the 



ET method to solutions of the associated continuous optimal transportation problem (Villani 2003 2009) for 



general multivariate random variables using cyclical monotonicity arguments. The application of continuous 



optimal transportation to Bayesian inference has been discussed by Moselhy and Marzouk (2012), Reich (2011 



2012), Cotter and Reich (2012). However, a direct application of optimal transportation to Bayesian inference 
in high-dimensional state spaces M.^ seems currently out of reach. It remains to investigate what modifications 



are required (such as localization (Evensen 2006)) in order to implement the proposed ET method even if the 



ensemble sizes M are much smaller than the dimension of state space N (or the dimension of the attractor of 
([S]) in case of intermittent data assimilation) . 
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